InClass_Ex05 : Build Logistic Regression to identify functional & non-functional water points in Osun state of Nigeria

Author

Yogendra Shinde

1. Objective.

In this exercise, we aim to build a logistic regression model to identify ‘Functional’ & ‘Non-Functional’ water-points in Osun state of Nigeria.

1.1 Input Data Used.

Input data used for this modeling are :

  1. Osun.rds  This file contains LGAs (Local Government Authority) boundaries of Osun state. It is sf polygon data frame and

  2. Osun_wp_sf.rds contained water points data.

1.2 Quick Notes on Logistic Regression.

Fig-1

2. Load required packages.

In this exercise we need packages given in the table below -

Table-1
# Package Function
1 sf

A package that provides simple features access for R.

Mainly used for importing, managing, and processing geospatial data.

2 tidyverse For performing data science tasks such as importing, wrangling and visualizing data.
3 funModeling This package contains a set of functions related to exploratory data analysis, data preparation, and model performance.
4 blorr Tool for building & validating binary logistic regression models.
5 corrplot For creating graphical display of a correlation matrix.
6 ggpubr For data visualization.
7 spdep Spatial Dependence - A collection of functions to create spatial weights matrix objects from polygon contiguities.
8 skimr Exploratory Data Analysis.
9 tmap For choropleth map creation.
10 caret For building machine learning package.
11 GWModel Geographically weighted (GW) models. Building machine learning model for particular branch of spatial statistics.

Following code chunk loads the required packages.

pacman::p_load(sf,tidyverse,funModeling,blorr,corrplot,ggpubr,spdep,GWmodel,
               tmap,skimr,caret)

3. Read Input Files.

Osun_sf <- read_rds("rds\\Osun_wp_sf.rds")
Osun <- read_rds("rds\\Osun.rds")
summary(Osun_sf)
     row_id          source             lat_deg         lon_deg     
 Min.   : 49601   Length:4760        Min.   :7.060   Min.   :4.077  
 1st Qu.: 66875   Class :character   1st Qu.:7.513   1st Qu.:4.359  
 Median : 68245   Mode  :character   Median :7.706   Median :4.559  
 Mean   : 68551                      Mean   :7.683   Mean   :4.544  
 3rd Qu.: 69562                      3rd Qu.:7.879   3rd Qu.:4.709  
 Max.   :471319                      Max.   :8.062   Max.   :5.055  
                                                                    
 report_date         status_id         water_source_clean water_source_category
 Length:4760        Length:4760        Length:4760        Length:4760          
 Class :character   Class :character   Class :character   Class :character     
 Mode  :character   Mode  :character   Mode  :character   Mode  :character     
                                                                               
                                                                               
                                                                               
                                                                               
 water_tech_clean   water_tech_category facility_type      clean_country_name
 Length:4760        Length:4760         Length:4760        Length:4760       
 Class :character   Class :character    Class :character   Class :character  
 Mode  :character   Mode  :character    Mode  :character   Mode  :character  
                                                                             
                                                                             
                                                                             
                                                                             
  clean_adm1         clean_adm2         clean_adm3         clean_adm4       
 Length:4760        Length:4760        Length:4760        Length:4760       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  install_year   installer         rehab_year     rehabilitator 
 Min.   :1917   Length:4760        Mode:logical   Mode:logical  
 1st Qu.:2006   Class :character   NA's:4760      NA's:4760     
 Median :2010   Mode  :character                                
 Mean   :2009                                                   
 3rd Qu.:2013                                                   
 Max.   :2015                                                   
 NA's   :1144                                                   
 management_clean   status_clean           pay           
 Length:4760        Length:4760        Length:4760       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
 fecal_coliform_presence fecal_coliform_value subjective_quality
 Length:4760             Min.   : NA          Length:4760       
 Class :character        1st Qu.: NA          Class :character  
 Mode  :character        Median : NA          Mode  :character  
                         Mean   :NaN                            
                         3rd Qu.: NA                            
                         Max.   : NA                            
                         NA's   :4760                           
 activity_id         scheme_id           wpdx_id             notes          
 Length:4760        Length:4760        Length:4760        Length:4760       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   orig_lnk          photo_lnk          country_id          data_lnk        
 Length:4760        Length:4760        Length:4760        Length:4760       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 distance_to_primary_road distance_to_secondary_road distance_to_tertiary_road
 Min.   :    0.014        Min.   :    0.152          Min.   :    0.018        
 1st Qu.:  719.362        1st Qu.:  460.897          1st Qu.:  121.250        
 Median : 2972.784        Median : 2554.255          Median :  521.768        
 Mean   : 5021.526        Mean   : 3750.470          Mean   : 1259.277        
 3rd Qu.: 7314.733        3rd Qu.: 5791.936          3rd Qu.: 1834.418        
 Max.   :26909.862        Max.   :19559.479          Max.   :10966.271        
                                                                              
 distance_to_city   distance_to_town water_point_history rehab_priority   
 Min.   :   53.05   Min.   :   30    Length:4760         Min.   :    0.0  
 1st Qu.: 7930.75   1st Qu.: 6877    Class :character    1st Qu.:    7.0  
 Median :15030.41   Median :12205    Mode  :character    Median :   91.5  
 Mean   :16663.99   Mean   :16727                        Mean   :  489.3  
 3rd Qu.:24255.75   3rd Qu.:27739                        3rd Qu.:  376.2  
 Max.   :47934.34   Max.   :44021                        Max.   :29697.0  
                                                         NA's   :2654     
 water_point_population local_population_1km crucialness_score
 Min.   :    0.0        Min.   :    0        Min.   :0.0001   
 1st Qu.:   14.0        1st Qu.:  176        1st Qu.:0.0655   
 Median :  119.0        Median : 1032        Median :0.1548   
 Mean   :  513.6        Mean   : 2727        Mean   :0.2643   
 3rd Qu.:  433.2        3rd Qu.: 3717        3rd Qu.:0.3510   
 Max.   :29697.0        Max.   :36118        Max.   :1.0000   
 NA's   :4              NA's   :4            NA's   :798      
 pressure_score    usage_capacity    is_urban       days_since_report
 Min.   : 0.0010   Min.   : 300.0   Mode :logical   Min.   :1483     
 1st Qu.: 0.1160   1st Qu.: 300.0   FALSE:2884      1st Qu.:2688     
 Median : 0.4067   Median : 300.0   TRUE :1876      Median :2693     
 Mean   : 1.4634   Mean   : 560.7                   Mean   :2693     
 3rd Qu.: 1.2367   3rd Qu.:1000.0                   3rd Qu.:2700     
 Max.   :93.6900   Max.   :1000.0                   Max.   :4645     
 NA's   :798                                                         
 staleness_score latest_record   location_id      cluster_size  
 Min.   :23.13   Mode:logical   Min.   : 23741   Min.   :1.000  
 1st Qu.:42.70   TRUE:4760      1st Qu.:230639   1st Qu.:1.000  
 Median :42.79                  Median :236200   Median :1.000  
 Mean   :42.80                  Mean   :235865   Mean   :1.053  
 3rd Qu.:42.86                  3rd Qu.:240061   3rd Qu.:1.000  
 Max.   :62.66                  Max.   :267454   Max.   :4.000  
                                                                
 clean_country_id   country_name       water_source        water_tech       
 Length:4760        Length:4760        Length:4760        Length:4760       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   status            adm2               adm3            management       
 Mode :logical   Length:4760        Length:4760        Length:4760       
 FALSE:2118      Class :character   Class :character   Class :character  
 TRUE :2642      Mode  :character   Mode  :character   Mode  :character  
                                                                         
                                                                         
                                                                         
                                                                         
     adm1           New Georeferenced Column lat_deg_original
 Length:4760        Length:4760              Min.   : NA     
 Class :character   Class :character         1st Qu.: NA     
 Mode  :character   Mode  :character         Median : NA     
                                             Mean   :NaN     
                                             3rd Qu.: NA     
                                             Max.   : NA     
                                             NA's   :4760    
 lat_lon_deg        lon_deg_original public_data_source  converted        
 Length:4760        Min.   : NA      Length:4760        Length:4760       
 Class :character   1st Qu.: NA      Class :character   Class :character  
 Mode  :character   Median : NA      Mode  :character   Mode  :character  
                    Mean   :NaN                                           
                    3rd Qu.: NA                                           
                    Max.   : NA                                           
                    NA's   :4760                                          
     count   created_timestamp  updated_timestamp           Geometry   
 Min.   :1   Length:4760        Length:4760        POINT        :4760  
 1st Qu.:1   Class :character   Class :character   epsg:26392   :   0  
 Median :1   Mode  :character   Mode  :character   +proj=tmer...:   0  
 Mean   :1                                                             
 3rd Qu.:1                                                             
 Max.   :1                                                             
                                                                       
   ADM2_EN           ADM2_PCODE          ADM1_EN           ADM1_PCODE       
 Length:4760        Length:4760        Length:4760        Length:4760       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            

Plot bar chart to understand distribution of ‘status’ field of Osun_sf data frame. Note that status field takes only 2 values. True and False.

Osun_sf %>%
  freq(input = "status")
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
  Please report the issue at <https://github.com/pablo14/funModeling/issues>.

  status frequency percentage cumulative_perc
1   TRUE      2642       55.5            55.5
2  FALSE      2118       44.5           100.0
tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(Osun)+
  tm_polygons(alpha = 0.4)+
  tm_shape(Osun_sf)+
  tm_dots(col = "status",
          alpha = 0.6)+
  tm_view(set.zoom.limits = c(9,12))
tmap_mode("plot")
tmap mode set to plotting

4. Exploratory Data Analysis.

Here we use skim() function to understand how data is distributed in Osun_Sf dataframe.

Here are some important observations -

  1. There are 4760 rows and 75 columns.

  2. We see that there are many fields where ~ 20% or more values are missing. For example rehab_priority, crucialness_score, pressure_score, install_year. We conclude to drop these variables as they are not useful to create sound machine learning model - especially Logistic Reg model.

Osun_sf %>%
  skim()
Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
Data summary
Name Piped data
Number of rows 4760
Number of columns 75
_______________________
Column type frequency:
character 47
logical 5
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.00 5 44 0 2 0
report_date 0 1.00 22 22 0 42 0
status_id 0 1.00 2 7 0 3 0
water_source_clean 0 1.00 8 22 0 3 0
water_source_category 0 1.00 4 6 0 2 0
water_tech_clean 24 0.99 9 23 0 3 0
water_tech_category 24 0.99 9 15 0 2 0
facility_type 0 1.00 8 8 0 1 0
clean_country_name 0 1.00 7 7 0 1 0
clean_adm1 0 1.00 3 5 0 5 0
clean_adm2 0 1.00 3 14 0 35 0
clean_adm3 4760 0.00 NA NA 0 0 0
clean_adm4 4760 0.00 NA NA 0 0 0
installer 4760 0.00 NA NA 0 0 0
management_clean 1573 0.67 5 37 0 7 0
status_clean 0 1.00 9 32 0 7 0
pay 0 1.00 2 39 0 7 0
fecal_coliform_presence 4760 0.00 NA NA 0 0 0
subjective_quality 0 1.00 18 20 0 4 0
activity_id 4757 0.00 36 36 0 3 0
scheme_id 4760 0.00 NA NA 0 0 0
wpdx_id 0 1.00 12 12 0 4760 0
notes 0 1.00 2 96 0 3502 0
orig_lnk 4757 0.00 84 84 0 1 0
photo_lnk 41 0.99 84 84 0 4719 0
country_id 0 1.00 2 2 0 1 0
data_lnk 0 1.00 79 96 0 2 0
water_point_history 0 1.00 142 834 0 4750 0
clean_country_id 0 1.00 3 3 0 1 0
country_name 0 1.00 7 7 0 1 0
water_source 0 1.00 8 30 0 4 0
water_tech 0 1.00 5 37 0 20 0
adm2 0 1.00 3 14 0 33 0
adm3 4760 0.00 NA NA 0 0 0
management 1573 0.67 5 47 0 7 0
adm1 0 1.00 4 5 0 4 0
New Georeferenced Column 0 1.00 16 35 0 4760 0
lat_lon_deg 0 1.00 13 32 0 4760 0
public_data_source 0 1.00 84 102 0 2 0
converted 0 1.00 53 53 0 1 0
created_timestamp 0 1.00 22 22 0 2 0
updated_timestamp 0 1.00 22 22 0 2 0
Geometry 0 1.00 33 37 0 4760 0
ADM2_EN 0 1.00 3 14 0 30 0
ADM2_PCODE 0 1.00 8 8 0 30 0
ADM1_EN 0 1.00 4 4 0 1 0
ADM1_PCODE 0 1.00 5 5 0 1 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rehab_year 4760 0 NaN :
rehabilitator 4760 0 NaN :
is_urban 0 1 0.39 FAL: 2884, TRU: 1876
latest_record 0 1 1.00 TRU: 4760
status 0 1 0.56 TRU: 2642, FAL: 2118

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 68550.48 10216.94 49601.00 66874.75 68244.50 69562.25 471319.00 ▇▁▁▁▁
lat_deg 0 1.00 7.68 0.22 7.06 7.51 7.71 7.88 8.06 ▁▂▇▇▇
lon_deg 0 1.00 4.54 0.21 4.08 4.36 4.56 4.71 5.06 ▃▆▇▇▂
install_year 1144 0.76 2008.63 6.04 1917.00 2006.00 2010.00 2013.00 2015.00 ▁▁▁▁▇
fecal_coliform_value 4760 0.00 NaN NA NA NA NA NA NA
distance_to_primary_road 0 1.00 5021.53 5648.34 0.01 719.36 2972.78 7314.73 26909.86 ▇▂▁▁▁
distance_to_secondary_road 0 1.00 3750.47 3938.63 0.15 460.90 2554.25 5791.94 19559.48 ▇▃▁▁▁
distance_to_tertiary_road 0 1.00 1259.28 1680.04 0.02 121.25 521.77 1834.42 10966.27 ▇▂▁▁▁
distance_to_city 0 1.00 16663.99 10960.82 53.05 7930.75 15030.41 24255.75 47934.34 ▇▇▆▃▁
distance_to_town 0 1.00 16726.59 12452.65 30.00 6876.92 12204.53 27739.46 44020.64 ▇▅▃▃▂
rehab_priority 2654 0.44 489.33 1658.81 0.00 7.00 91.50 376.25 29697.00 ▇▁▁▁▁
water_point_population 4 1.00 513.58 1458.92 0.00 14.00 119.00 433.25 29697.00 ▇▁▁▁▁
local_population_1km 4 1.00 2727.16 4189.46 0.00 176.00 1032.00 3717.00 36118.00 ▇▁▁▁▁
crucialness_score 798 0.83 0.26 0.28 0.00 0.07 0.15 0.35 1.00 ▇▃▁▁▁
pressure_score 798 0.83 1.46 4.16 0.00 0.12 0.41 1.24 93.69 ▇▁▁▁▁
usage_capacity 0 1.00 560.74 338.46 300.00 300.00 300.00 1000.00 1000.00 ▇▁▁▁▅
days_since_report 0 1.00 2692.69 41.92 1483.00 2688.00 2693.00 2700.00 4645.00 ▁▇▁▁▁
staleness_score 0 1.00 42.80 0.58 23.13 42.70 42.79 42.86 62.66 ▁▁▇▁▁
location_id 0 1.00 235865.49 6657.60 23741.00 230638.75 236199.50 240061.25 267454.00 ▁▁▁▁▇
cluster_size 0 1.00 1.05 0.25 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
lat_deg_original 4760 0.00 NaN NA NA NA NA NA NA
lon_deg_original 4760 0.00 NaN NA NA NA NA NA NA
count 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁

We create a clean file using following chunk of code. Note than we have excluded missing values & created usage_capacity as factor.

Osun_wp_sf_clean <- Osun_sf %>%
  filter_at(vars(status,
                 distance_to_primary_road,
                 distance_to_secondary_road,
                 distance_to_tertiary_road,
                 distance_to_city,
                 distance_to_town,
                 water_point_population,
                 local_population_1km,
                 usage_capacity,
                 is_urban,
                 water_source_clean),
            all_vars(!is.na(.))) %>%
  mutate(usage_capacity = as.factor(usage_capacity))
  • Note that Osun_wp_sf_clean file contains 4 less records.
summary(Osun_wp_sf_clean)
     row_id          source             lat_deg         lon_deg     
 Min.   : 49601   Length:4756        Min.   :7.060   Min.   :4.077  
 1st Qu.: 66876   Class :character   1st Qu.:7.513   1st Qu.:4.359  
 Median : 68245   Mode  :character   Median :7.706   Median :4.559  
 Mean   : 68551                      Mean   :7.683   Mean   :4.544  
 3rd Qu.: 69562                      3rd Qu.:7.879   3rd Qu.:4.709  
 Max.   :471319                      Max.   :8.062   Max.   :5.055  
                                                                    
 report_date         status_id         water_source_clean water_source_category
 Length:4756        Length:4756        Length:4756        Length:4756          
 Class :character   Class :character   Class :character   Class :character     
 Mode  :character   Mode  :character   Mode  :character   Mode  :character     
                                                                               
                                                                               
                                                                               
                                                                               
 water_tech_clean   water_tech_category facility_type      clean_country_name
 Length:4756        Length:4756         Length:4756        Length:4756       
 Class :character   Class :character    Class :character   Class :character  
 Mode  :character   Mode  :character    Mode  :character   Mode  :character  
                                                                             
                                                                             
                                                                             
                                                                             
  clean_adm1         clean_adm2         clean_adm3         clean_adm4       
 Length:4756        Length:4756        Length:4756        Length:4756       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  install_year   installer         rehab_year     rehabilitator 
 Min.   :1917   Length:4756        Mode:logical   Mode:logical  
 1st Qu.:2006   Class :character   NA's:4756      NA's:4756     
 Median :2010   Mode  :character                                
 Mean   :2009                                                   
 3rd Qu.:2013                                                   
 Max.   :2015                                                   
 NA's   :1143                                                   
 management_clean   status_clean           pay           
 Length:4756        Length:4756        Length:4756       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
 fecal_coliform_presence fecal_coliform_value subjective_quality
 Length:4756             Min.   : NA          Length:4756       
 Class :character        1st Qu.: NA          Class :character  
 Mode  :character        Median : NA          Mode  :character  
                         Mean   :NaN                            
                         3rd Qu.: NA                            
                         Max.   : NA                            
                         NA's   :4756                           
 activity_id         scheme_id           wpdx_id             notes          
 Length:4756        Length:4756        Length:4756        Length:4756       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   orig_lnk          photo_lnk          country_id          data_lnk        
 Length:4756        Length:4756        Length:4756        Length:4756       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 distance_to_primary_road distance_to_secondary_road distance_to_tertiary_road
 Min.   :    0.014        Min.   :    0.152          Min.   :    0.018        
 1st Qu.:  719.362        1st Qu.:  460.503          1st Qu.:  121.334        
 Median : 2968.379        Median : 2554.255          Median :  521.768        
 Mean   : 5021.729        Mean   : 3751.000          Mean   : 1259.650        
 3rd Qu.: 7314.733        3rd Qu.: 5791.936          3rd Qu.: 1834.418        
 Max.   :26909.862        Max.   :19559.479          Max.   :10966.271        
                                                                              
 distance_to_city   distance_to_town water_point_history rehab_priority   
 Min.   :   53.05   Min.   :   30    Length:4756         Min.   :    0.0  
 1st Qu.: 7930.75   1st Qu.: 6877    Class :character    1st Qu.:    7.0  
 Median :15020.40   Median :12215    Mode  :character    Median :   91.5  
 Mean   :16662.78   Mean   :16732                        Mean   :  489.3  
 3rd Qu.:24255.75   3rd Qu.:27746                        3rd Qu.:  376.2  
 Max.   :47934.34   Max.   :44021                        Max.   :29697.0  
                                                         NA's   :2650     
 water_point_population local_population_1km crucialness_score
 Min.   :    0.0        Min.   :    0        Min.   :0.0001   
 1st Qu.:   14.0        1st Qu.:  176        1st Qu.:0.0655   
 Median :  119.0        Median : 1032        Median :0.1548   
 Mean   :  513.6        Mean   : 2727        Mean   :0.2643   
 3rd Qu.:  433.2        3rd Qu.: 3717        3rd Qu.:0.3510   
 Max.   :29697.0        Max.   :36118        Max.   :1.0000   
                                             NA's   :794      
 pressure_score    usage_capacity  is_urban       days_since_report
 Min.   : 0.0010   300 :2986      Mode :logical   Min.   :1483     
 1st Qu.: 0.1160   1000:1770      FALSE:2882      1st Qu.:2688     
 Median : 0.4067                  TRUE :1874      Median :2693     
 Mean   : 1.4634                                  Mean   :2693     
 3rd Qu.: 1.2367                                  3rd Qu.:2700     
 Max.   :93.6900                                  Max.   :4645     
 NA's   :794                                                       
 staleness_score latest_record   location_id      cluster_size  
 Min.   :23.13   Mode:logical   Min.   : 23741   Min.   :1.000  
 1st Qu.:42.70   TRUE:4756      1st Qu.:230639   1st Qu.:1.000  
 Median :42.79                  Median :236199   Median :1.000  
 Mean   :42.80                  Mean   :235865   Mean   :1.053  
 3rd Qu.:42.86                  3rd Qu.:240062   3rd Qu.:1.000  
 Max.   :62.66                  Max.   :267454   Max.   :4.000  
                                                                
 clean_country_id   country_name       water_source        water_tech       
 Length:4756        Length:4756        Length:4756        Length:4756       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   status            adm2               adm3            management       
 Mode :logical   Length:4756        Length:4756        Length:4756       
 FALSE:2114      Class :character   Class :character   Class :character  
 TRUE :2642      Mode  :character   Mode  :character   Mode  :character  
                                                                         
                                                                         
                                                                         
                                                                         
     adm1           New Georeferenced Column lat_deg_original
 Length:4756        Length:4756              Min.   : NA     
 Class :character   Class :character         1st Qu.: NA     
 Mode  :character   Mode  :character         Median : NA     
                                             Mean   :NaN     
                                             3rd Qu.: NA     
                                             Max.   : NA     
                                             NA's   :4756    
 lat_lon_deg        lon_deg_original public_data_source  converted        
 Length:4756        Min.   : NA      Length:4756        Length:4756       
 Class :character   1st Qu.: NA      Class :character   Class :character  
 Mode  :character   Median : NA      Mode  :character   Mode  :character  
                    Mean   :NaN                                           
                    3rd Qu.: NA                                           
                    Max.   : NA                                           
                    NA's   :4756                                          
     count   created_timestamp  updated_timestamp           Geometry   
 Min.   :1   Length:4756        Length:4756        POINT        :4756  
 1st Qu.:1   Class :character   Class :character   epsg:26392   :   0  
 Median :1   Mode  :character   Mode  :character   +proj=tmer...:   0  
 Mean   :1                                                             
 3rd Qu.:1                                                             
 Max.   :1                                                             
                                                                       
   ADM2_EN           ADM2_PCODE          ADM1_EN           ADM1_PCODE       
 Length:4756        Length:4756        Length:4756        Length:4756       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            

5. Correlation Analysis.

Osun_wp <- Osun_wp_sf_clean %>%
  select(c(7,35:39,42:43,46:47,57)) %>%
  st_set_geometry(NULL) # Drop geometry

5.1 Correlation Matrix.

cluster_vars.cor = cor(Osun_wp[,2:7])
corrplot.mixed(cluster_vars.cor,tl.cex = 0.7,
         lower = "ellipse", number.cex = 0.6,
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

Observation -

We observe that none of the variables are highly correlated. We use rule of thumb, where correlation coefficient >= 0.8 is considered as high correlation and we would recommend that such variables should not be considered for correlation.

6. Perform Logistics Regression.

In the code chunk below, we use glm() function of R to build logistic regression for the water point status.

model <- glm(status~ distance_to_primary_road+
               distance_to_secondary_road+
               distance_to_tertiary_road+
               distance_to_city+
               distance_to_town+
               is_urban+
               usage_capacity+
               water_source_clean+
               water_point_population+
               local_population_1km,
             data = Osun_wp_sf_clean,
             family = binomial(link = 'logit'))

Here we use blorr package to generate report.

blr_regress(model)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4744           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3887        0.1124      3.4588       5e-04 
        distance_to_primary_road            1      0.0000        0.0000     -0.7153      0.4744 
       distance_to_secondary_road           1      0.0000        0.0000     -0.5530      0.5802 
       distance_to_tertiary_road            1      1e-04         0.0000      4.6708      0.0000 
            distance_to_city                1      0.0000        0.0000     -4.7574      0.0000 
            distance_to_town                1      0.0000        0.0000     -4.9170      0.0000 
              is_urbanTRUE                  1     -0.2971        0.0819     -3.6294       3e-04 
           usage_capacity1000               1     -0.6230        0.0697     -8.9366      0.0000 
water_source_cleanProtected Shallow Well    1      0.5040        0.0857      5.8783      0.0000 
   water_source_cleanProtected Spring       1      1.2882        0.4388      2.9359      0.0033 
         water_point_population             1      -5e-04        0.0000    -11.3686      0.0000 
          local_population_1km              1      3e-04         0.0000     19.2953      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7347          Somers' D        0.4693   
% Discordant          0.2653          Gamma            0.4693   
% Tied                0.0000          Tau-a            0.2318   
Pairs                5585188          c                0.7347   
---------------------------------------------------------------

6.1 Interpretation of the report.

  1. Response Summary tells us that 2114 records belong to class 0 and 2642 records belong to class 1.

  2. At 95% confidence level, variables with p-value less than 0.05 are statistically significant. These are all independent variables except distance_to_primary_road and distance_to_secondary_road.

  3. Maximum Likelihood Report tells us that ‘Estimate’ column gives us correlation coefficient which ranges from -1 to +1. Please ignore correlation coefficient 1.2882 as it is for the categorical variable ‘water_source_cleanProtected Spring’ and thus it has no significance.

    Similarly , water_point_population and local_population_1km are categorical variables and should not be considered for analysis where correlation co-efficient is evaluated.

  4. For continuous variables - A positive value implies a direct correlation and a negative value implies an negative/inverse correlation. Value closer to 1 implies strong positive relation and value closer to -1 indicates strong negative correlation.

6.2 Confusion Matrix.

blr_confusion_matrix(model,cutoff = 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1301  738
         1   813 1904

                Accuracy : 0.6739 
     No Information Rate : 0.4445 

                   Kappa : 0.3373 

McNemars's Test P-Value  : 0.0602 

             Sensitivity : 0.7207 
             Specificity : 0.6154 
          Pos Pred Value : 0.7008 
          Neg Pred Value : 0.6381 
              Prevalence : 0.5555 
          Detection Rate : 0.4003 
    Detection Prevalence : 0.5713 
       Balanced Accuracy : 0.6680 
               Precision : 0.7008 
                  Recall : 0.7207 

        'Positive' Class : 1

6.3 Interpretation of Confusion Matrix.

  1. In order to assess the overall performance of a logistic regression model, we tend to refer Misclassification Rate. The classification table above shows that there are 346 false negative and 275 false positive. The overall misclassification error is 22.06% (i.e. (738+813)/4756) = 32.61%

    According to the Misclassification Rate measure, the model predicts 100 - 32.61 = 67.39 % of the water point status correctly - which is the accuracy of the model.

  2. Let us understand True Positive Rate and True Negative Rate. See following figure for reference.

  3. Sensitivity also known as true positive rate or recall. It answers the question, “If the model predicts a positive event, what is the probability that it really is positive?”.Our model shows that Sensitivity = 72.07%

  4. Specificity is the true negative rate. It answer the question, “If the model predicts a negative event, what is the probability that it really is negative?”. Our model shows that Specificity = 61.54%

    Metrics

7. How can we improve performance ?

Though our results are encouraging for first try however there is still lot of scope for improvement. Let us convert Simple Feature Dataframe into Spatial Point Polygon (Spatial point dataframe) version

Osun_wp_sp <- Osun_wp_sf_clean %>%
  select(c(status,
           distance_to_primary_road,
           distance_to_secondary_road,
           distance_to_tertiary_road,
           distance_to_city,
           distance_to_town,
           water_point_population,
           local_population_1km,
           is_urban,
           usage_capacity,
           water_source_clean
           )) %>%
  as_Spatial()
#
Osun_wp_sp
class       : SpatialPointsDataFrame 
features    : 4756 
extent      : 182502.4, 290751, 340054.1, 450905.3  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs 
variables   : 11
names       : status, distance_to_primary_road, distance_to_secondary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, is_urban, usage_capacity, water_source_clean 
min values  :      0,        0.014461356813335,          0.152195902540837,         0.017815121653488, 53.0461399623541, 30.0019777713073,                      0,                    0,        0,           1000,           Borehole 
max values  :      1,         26909.8616132094,           19559.4793799085,          10966.2705628969,  47934.343603562, 44020.6393368124,                  29697,                36118,        1,            300,   Protected Spring 

Important Note - We have now Osun_wp_sp with 4 records less. We have 4756 records instead of 4760.

8. Calculate Distance Matrix -Fixed Bandwidth.

bw.fixed <- bw.ggwr(status ~ distance_to_primary_road +
                      distance_to_secondary_road+
                      distance_to_tertiary_road+
                      distance_to_city+
                      distance_to_town+
                      water_point_population+
                      local_population_1km+
                      is_urban+
                      usage_capacity+
                      water_source_clean,
                    data = Osun_wp_sp,
                    family = "binomial",
                    approach = "AIC",
                    kernel = "gaussian",
                    adaptive = FALSE, # for fixed bandwidth
                    longlat = FALSE)# input data have been converted to #projected CRS
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
 Iteration    Log-Likelihood:(With bandwidth:  95768.67 )
=========================
       0        -2889 
       1        -2836 
       2        -2830 
       3        -2829 
       4        -2829 
       5        -2829 
Fixed bandwidth: 95768.67 AICc value: 5684.357 
 Iteration    Log-Likelihood:(With bandwidth:  59200.13 )
=========================
       0        -2875 
       1        -2818 
       2        -2810 
       3        -2808 
       4        -2808 
       5        -2808 
Fixed bandwidth: 59200.13 AICc value: 5646.785 
 Iteration    Log-Likelihood:(With bandwidth:  36599.53 )
=========================
       0        -2847 
       1        -2781 
       2        -2768 
       3        -2765 
       4        -2765 
       5        -2765 
       6        -2765 
Fixed bandwidth: 36599.53 AICc value: 5575.148 
 Iteration    Log-Likelihood:(With bandwidth:  22631.59 )
=========================
       0        -2798 
       1        -2719 
       2        -2698 
       3        -2693 
       4        -2693 
       5        -2693 
       6        -2693 
Fixed bandwidth: 22631.59 AICc value: 5466.883 
 Iteration    Log-Likelihood:(With bandwidth:  13998.93 )
=========================
       0        -2720 
       1        -2622 
       2        -2590 
       3        -2581 
       4        -2580 
       5        -2580 
       6        -2580 
       7        -2580 
Fixed bandwidth: 13998.93 AICc value: 5324.578 
 Iteration    Log-Likelihood:(With bandwidth:  8663.649 )
=========================
       0        -2601 
       1        -2476 
       2        -2431 
       3        -2419 
       4        -2417 
       5        -2417 
       6        -2417 
       7        -2417 
Fixed bandwidth: 8663.649 AICc value: 5163.61 
 Iteration    Log-Likelihood:(With bandwidth:  5366.266 )
=========================
       0        -2436 
       1        -2268 
       2        -2194 
       3        -2167 
       4        -2161 
       5        -2161 
       6        -2161 
       7        -2161 
       8        -2161 
       9        -2161 
Fixed bandwidth: 5366.266 AICc value: 4990.587 
 Iteration    Log-Likelihood:(With bandwidth:  3328.371 )
=========================
       0        -2157 
       1        -1922 
       2        -1802 
       3        -1739 
       4        -1713 
       5        -1713 
Fixed bandwidth: 3328.371 AICc value: 4798.288 
 Iteration    Log-Likelihood:(With bandwidth:  2068.882 )
=========================
       0        -1751 
       1        -1421 
       2        -1238 
       3        -1133 
       4        -1084 
       5        -1084 
Fixed bandwidth: 2068.882 AICc value: 4837.017 
 Iteration    Log-Likelihood:(With bandwidth:  4106.777 )
=========================
       0        -2297 
       1        -2095 
       2        -1997 
       3        -1951 
       4        -1938 
       5        -1936 
       6        -1936 
       7        -1936 
       8        -1936 
Fixed bandwidth: 4106.777 AICc value: 4873.161 
 Iteration    Log-Likelihood:(With bandwidth:  2847.289 )
=========================
       0        -2036 
       1        -1771 
       2        -1633 
       3        -1558 
       4        -1525 
       5        -1525 
Fixed bandwidth: 2847.289 AICc value: 4768.192 
 Iteration    Log-Likelihood:(With bandwidth:  2549.964 )
=========================
       0        -1941 
       1        -1655 
       2        -1503 
       3        -1417 
       4        -1378 
       5        -1378 
Fixed bandwidth: 2549.964 AICc value: 4762.212 
 Iteration    Log-Likelihood:(With bandwidth:  2366.207 )
=========================
       0        -1874 
       1        -1573 
       2        -1410 
       3        -1316 
       4        -1274 
       5        -1274 
Fixed bandwidth: 2366.207 AICc value: 4773.081 
 Iteration    Log-Likelihood:(With bandwidth:  2663.532 )
=========================
       0        -1979 
       1        -1702 
       2        -1555 
       3        -1474 
       4        -1438 
       5        -1438 
Fixed bandwidth: 2663.532 AICc value: 4762.568 
 Iteration    Log-Likelihood:(With bandwidth:  2479.775 )
=========================
       0        -1917 
       1        -1625 
       2        -1468 
       3        -1380 
       4        -1339 
       5        -1339 
Fixed bandwidth: 2479.775 AICc value: 4764.294 
 Iteration    Log-Likelihood:(With bandwidth:  2593.343 )
=========================
       0        -1956 
       1        -1674 
       2        -1523 
       3        -1439 
       4        -1401 
       5        -1401 
Fixed bandwidth: 2593.343 AICc value: 4761.813 
 Iteration    Log-Likelihood:(With bandwidth:  2620.153 )
=========================
       0        -1965 
       1        -1685 
       2        -1536 
       3        -1453 
       4        -1415 
       5        -1415 
Fixed bandwidth: 2620.153 AICc value: 4761.89 
 Iteration    Log-Likelihood:(With bandwidth:  2576.774 )
=========================
       0        -1950 
       1        -1667 
       2        -1515 
       3        -1431 
       4        -1393 
       5        -1393 
Fixed bandwidth: 2576.774 AICc value: 4761.889 
 Iteration    Log-Likelihood:(With bandwidth:  2603.584 )
=========================
       0        -1960 
       1        -1678 
       2        -1528 
       3        -1445 
       4        -1407 
       5        -1407 
Fixed bandwidth: 2603.584 AICc value: 4761.813 
 Iteration    Log-Likelihood:(With bandwidth:  2609.913 )
=========================
       0        -1962 
       1        -1680 
       2        -1531 
       3        -1448 
       4        -1410 
       5        -1410 
Fixed bandwidth: 2609.913 AICc value: 4761.831 
 Iteration    Log-Likelihood:(With bandwidth:  2599.672 )
=========================
       0        -1958 
       1        -1676 
       2        -1526 
       3        -1443 
       4        -1405 
       5        -1405 
Fixed bandwidth: 2599.672 AICc value: 4761.809 
 Iteration    Log-Likelihood:(With bandwidth:  2597.255 )
=========================
       0        -1957 
       1        -1675 
       2        -1525 
       3        -1441 
       4        -1403 
       5        -1403 
Fixed bandwidth: 2597.255 AICc value: 4761.809 
  • AICc - Akaike Information Criterion Corrected value is 4761.809.
bw.fixed
[1] 2599.672

We get the above output. We feed it into the bw argument in ggwr.basic() of GWmodel in the code chunk below.

gwlr.fixed <- ggwr.basic(status ~ distance_to_primary_road+
                           distance_to_secondary_road+
                           distance_to_city+
                           distance_to_town+
                           water_point_population+
                           local_population_1km+
                           is_urban+
                           usage_capacity+
                           water_source_clean,
                         data = Osun_wp_sp,
                         bw = bw.fixed,
                         family = "binomial",
                         kernel = "gaussian",
                         adaptive = FALSE,
                         longlat = FALSE)
 Iteration    Log-Likelihood
=========================
       0        -2009 
       1        -1738 
       2        -1595 
       3        -1518 
       4        -1486 
       5        -1486 
gwlr.fixed
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2022-12-19 07:10:45 
   Call:
   ggwr.basic(formula = status ~ distance_to_primary_road + distance_to_secondary_road + 
    distance_to_city + distance_to_town + water_point_population + 
    local_population_1km + is_urban + usage_capacity + water_source_clean, 
    data = Osun_wp_sp, bw = bw.fixed, family = "binomial", kernel = "gaussian", 
    adaptive = FALSE, longlat = FALSE)

   Dependent (y) variable:  status
   Independent variables:  distance_to_primary_road distance_to_secondary_road distance_to_city distance_to_town water_point_population local_population_1km is_urban usage_capacity water_source_clean
   Number of data points: 4756
   Used family: binomial
   ***********************************************************************
   *              Results of Generalized linear Regression               *
   ***********************************************************************

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-115.519    -1.765     1.074     1.756    37.016  

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
Intercept                                 4.828e-01  1.105e-01   4.370 1.24e-05
distance_to_primary_road                 -9.749e-06  6.388e-06  -1.526 0.127012
distance_to_secondary_road               -7.826e-06  9.264e-06  -0.845 0.398227
distance_to_city                         -1.244e-05  3.399e-06  -3.658 0.000254
distance_to_town                         -1.225e-05  2.952e-06  -4.148 3.35e-05
water_point_population                   -4.952e-04  4.447e-05 -11.136  < 2e-16
local_population_1km                      3.404e-04  1.778e-05  19.142  < 2e-16
is_urbanTRUE                             -3.776e-01  7.990e-02  -4.726 2.29e-06
usage_capacity1000                       -6.547e-01  6.923e-02  -9.458  < 2e-16
water_source_cleanProtected Shallow Well  4.823e-01  8.538e-02   5.649 1.61e-08
water_source_cleanProtected Spring        1.219e+00  4.380e-01   2.783 0.005393
                                            
Intercept                                ***
distance_to_primary_road                    
distance_to_secondary_road                  
distance_to_city                         ***
distance_to_town                         ***
water_point_population                   ***
local_population_1km                     ***
is_urbanTRUE                             ***
usage_capacity1000                       ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring       ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6534.5  on 4755  degrees of freedom
Residual deviance: 5710.2  on 4745  degrees of freedom
AIC: 5732.2

Number of Fisher Scoring iterations: 5


 AICc:  5732.272
 Pseudo R-square value:  0.1261403
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 2599.672 
   Regression points: the same locations as observations are used.
   Distance metric: A distance matrix is specified for this model calibration.

   ************Summary of Generalized GWR coefficient estimates:**********
                                                   Min.     1st Qu.      Median
   Intercept                                -9.2644e+02 -3.7596e+00  1.9876e+00
   distance_to_primary_road                 -1.4916e-02 -4.5206e-04 -4.6235e-05
   distance_to_secondary_road               -1.0417e-02 -2.7516e-04  8.9005e-05
   distance_to_city                         -3.1672e-02 -6.1472e-04 -1.2605e-04
   distance_to_town                         -3.1309e-02 -5.2199e-04 -1.3036e-04
   water_point_population                   -4.2113e-02 -2.0993e-03 -9.8422e-04
   local_population_1km                     -1.0362e-01  4.4413e-04  9.9757e-04
   is_urbanTRUE                             -4.0082e+02 -4.1620e+00 -1.3661e+00
   usage_capacity1000                       -2.6050e+01 -9.9378e-01 -4.4356e-01
   water_source_cleanProtected.Shallow.Well -1.9221e+02 -3.7680e-01  4.6390e-01
   water_source_cleanProtected.Spring       -3.8607e+02 -5.4021e+00  2.8650e+00
                                                3rd Qu.      Max.
   Intercept                                 1.2246e+01 1656.1754
   distance_to_primary_road                  4.7318e-04    0.0180
   distance_to_secondary_road                5.1996e-04    0.0384
   distance_to_city                          2.0104e-04    0.0159
   distance_to_town                          1.9161e-04    0.0258
   water_point_population                    3.0127e-04    0.1153
   local_population_1km                      1.7451e-03    0.0341
   is_urbanTRUE                              1.3146e+00  753.4661
   usage_capacity1000                        3.0436e-01    7.2031
   water_source_cleanProtected.Shallow.Well  1.7250e+00   21.4072
   water_source_cleanProtected.Spring        8.1053e+00  345.9817
   ************************Diagnostic information*************************
   Number of data points: 4756 
   GW Deviance: 2960.796 
   AIC : 4461.633 
   AICc : 4743.249 
   Pseudo R-square value:  0.5468963 

   ***********************************************************************
   Program stops at: 2022-12-19 07:11:13 

To assess the performance of the gwLR, firstly, we will convert the SDF object in as data frame by using the code chunk below.

gwr.fixed <- as.data.frame(gwlr.fixed$SDF)

Now, we will label yhat (predicted) values as

  • if yhat >= 0.5 then 1 and

  • if yhat < 0.5 then 0

gwr.fixed <- gwr.fixed %>%
  mutate(most = ifelse(
    gwr.fixed$yhat >= 0.5, T,F)
  )
freq(gwr.fixed$y)
Warning in freq(gwr.fixed$y): All input values are NA.
NULL
freq(gwr.fixed$most)
Warning in freq(gwr.fixed$most): All input values are NA.
NULL
gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data = gwr.fixed$most,
                      reference= gwr.fixed$y,
                      positive = "TRUE" )
CM
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1794  289
     TRUE    320 2353
                                          
               Accuracy : 0.872           
                 95% CI : (0.8621, 0.8813)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7403          
                                          
 Mcnemar's Test P-Value : 0.2241          
                                          
            Sensitivity : 0.8906          
            Specificity : 0.8486          
         Pos Pred Value : 0.8803          
         Neg Pred Value : 0.8613          
             Prevalence : 0.5555          
         Detection Rate : 0.4947          
   Detection Prevalence : 0.5620          
      Balanced Accuracy : 0.8696          
                                          
       'Positive' Class : TRUE            
                                          

We have used argument positive = “TRUE”.

Accuracy = 87.2% and

Sensitivity = 89.06% and

Specificity = 84.86 %

Osun_wp_sf_selected <- Osun_wp_sf_clean %>%
  select(c(ADM2_EN, ADM2_PCODE,ADM1_EN,ADM1_PCODE, status))

Now let us append gwr.fixed matrix onto osun_wp_sf_selected to produce an output simple feature object called gwr_sf.fixed using cbind() function

gwr_sf.fixed <- cbind(Osun_wp_sf_selected, gwr.fixed)
tmap_mode("view")
tmap mode set to interactive viewing
actual <- tm_shape(Osun) +
  tmap_options(check.and.fix = TRUE) +
  tm_polygons(alpha = 0.4) +
  tm_shape(Osun_sf) +
  tm_dots(col = "status",
          alpha = 0.6,
          palette = "YlOrRd") +
  tm_view(set.zoom.limits = c(8, 12))

prob_T <- tm_shape(Osun) +
  tm_polygons(alpha = 0.4) +
  tm_shape(gwr_sf.fixed) + 
  tm_dots(col = "yhat",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(8, 12))

tmap_arrange(actual, prob_T, 
             asp = 1, ncol = 2, sync = TRUE)

We see that the predictions are largely aligned with the actual status of the water points

9. Visualizing Co-efficient Estimates.

The code chunk below is used to create an interactive point symbol map.

Remember yhat meaning predicted value of dependent variable Y.

tmap_mode("view")
tmap mode set to interactive viewing
prob_T <- tm_shape(Osun)+
  tm_polygons(alpha = 0.1)+
  tm_shape(gwr_sf.fixed)+
  tm_dots(col = "yhat",
border.col = 'gray60',
border.lwd  = 1)+
    tm_view(set.zoom.limits = c(8.5,14))
#
prob_T
tmap_mode("plot")
tmap mode set to plotting

10. Employing Only Statistically Significant Variables in Global and gwLR Models.

10.1 - Drop not statistically significant variables.

As we earlier saw that 2 of the 10 variables, distance_to_primary_road and distance_to_secondary_road, are not statistically significant (p-values > 0.05), we should build logistic regression models without these 2 variables.

Hence, we repeat the relevant steps above to replicate the model building, assessment and visualisation process in the following code chunks, starting with constructing the model with only the 8 statistically significant variables.

model_refined <- glm(status ~ distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               is_urban +
               usage_capacity +
               water_source_clean +
               water_point_population +
               local_population_1km,
             data = Osun_wp_sp,
             family = binomial(link = "logit"))

blr_regress(model_refined)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4746           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3540        0.1055      3.3541       8e-04 
       distance_to_tertiary_road            1      1e-04         0.0000      4.9096      0.0000 
            distance_to_city                1      0.0000        0.0000     -5.2022      0.0000 
            distance_to_town                1      0.0000        0.0000     -5.4660      0.0000 
              is_urbanTRUE                  1     -0.2667        0.0747     -3.5690       4e-04 
           usage_capacity1000               1     -0.6206        0.0697     -8.9081      0.0000 
water_source_cleanProtected Shallow Well    1      0.4947        0.0850      5.8228      0.0000 
   water_source_cleanProtected Spring       1      1.2790        0.4384      2.9174      0.0035 
         water_point_population             1      -5e-04        0.0000    -11.3902      0.0000 
          local_population_1km              1      3e-04         0.0000     19.4069      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7349          Somers' D        0.4697   
% Discordant          0.2651          Gamma            0.4697   
% Tied                0.0000          Tau-a            0.2320   
Pairs                5585188          c                0.7349   
---------------------------------------------------------------

We check and see that the remaining variables are all statistically significant to the linear regression model (p-values < 0.05).

The code chunk below calculates and displays the confusion matrix for the refined model. We will discuss the results together with that for the refined gwLR model in the subsequent subsection.

blr_confusion_matrix(model_refined, cutoff = 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1300  743
         1   814 1899

                Accuracy : 0.6726 
     No Information Rate : 0.4445 

                   Kappa : 0.3348 

McNemars's Test P-Value  : 0.0761 

             Sensitivity : 0.7188 
             Specificity : 0.6149 
          Pos Pred Value : 0.7000 
          Neg Pred Value : 0.6363 
              Prevalence : 0.5555 
          Detection Rate : 0.3993 
    Detection Prevalence : 0.5704 
       Balanced Accuracy : 0.6669 
               Precision : 0.7000 
                  Recall : 0.7188 

        'Positive' Class : 1

10.2 Determining Fixed Bandwidth for GWR Model.

bw.fixed_refined <- bw.ggwr(status ~ distance_to_tertiary_road +
                      distance_to_city +
                      distance_to_town +
                      is_urban +
                      usage_capacity +
                      water_source_clean +
                      water_point_population +
                      local_population_1km,
                      data = Osun_wp_sp,
                    family = "binomial",
                    approach  = "AIC",
                    kernel = "gaussian",
                    adaptive = FALSE, # for fixed bandwidth
                    longlat = FALSE) # input data have been converted to projected CRS
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
 Iteration    Log-Likelihood:(With bandwidth:  95768.67 )
=========================
       0        -2890 
       1        -2837 
       2        -2830 
       3        -2829 
       4        -2829 
       5        -2829 
Fixed bandwidth: 95768.67 AICc value: 5681.18 
 Iteration    Log-Likelihood:(With bandwidth:  59200.13 )
=========================
       0        -2878 
       1        -2820 
       2        -2812 
       3        -2810 
       4        -2810 
       5        -2810 
Fixed bandwidth: 59200.13 AICc value: 5645.901 
 Iteration    Log-Likelihood:(With bandwidth:  36599.53 )
=========================
       0        -2854 
       1        -2790 
       2        -2777 
       3        -2774 
       4        -2774 
       5        -2774 
       6        -2774 
Fixed bandwidth: 36599.53 AICc value: 5585.354 
 Iteration    Log-Likelihood:(With bandwidth:  22631.59 )
=========================
       0        -2810 
       1        -2732 
       2        -2711 
       3        -2707 
       4        -2707 
       5        -2707 
       6        -2707 
Fixed bandwidth: 22631.59 AICc value: 5481.877 
 Iteration    Log-Likelihood:(With bandwidth:  13998.93 )
=========================
       0        -2732 
       1        -2635 
       2        -2604 
       3        -2597 
       4        -2596 
       5        -2596 
       6        -2596 
Fixed bandwidth: 13998.93 AICc value: 5333.718 
 Iteration    Log-Likelihood:(With bandwidth:  8663.649 )
=========================
       0        -2624 
       1        -2502 
       2        -2459 
       3        -2447 
       4        -2446 
       5        -2446 
       6        -2446 
       7        -2446 
Fixed bandwidth: 8663.649 AICc value: 5178.493 
 Iteration    Log-Likelihood:(With bandwidth:  5366.266 )
=========================
       0        -2478 
       1        -2319 
       2        -2250 
       3        -2225 
       4        -2219 
       5        -2219 
       6        -2220 
       7        -2220 
       8        -2220 
       9        -2220 
Fixed bandwidth: 5366.266 AICc value: 5022.016 
 Iteration    Log-Likelihood:(With bandwidth:  3328.371 )
=========================
       0        -2222 
       1        -2002 
       2        -1894 
       3        -1838 
       4        -1818 
       5        -1814 
       6        -1814 
Fixed bandwidth: 3328.371 AICc value: 4827.587 
 Iteration    Log-Likelihood:(With bandwidth:  2068.882 )
=========================
       0        -1837 
       1        -1528 
       2        -1357 
       3        -1261 
       4        -1222 
       5        -1222 
Fixed bandwidth: 2068.882 AICc value: 4772.046 
 Iteration    Log-Likelihood:(With bandwidth:  1290.476 )
=========================
       0        -1403 
       1        -1016 
       2       -807.3 
       3       -680.2 
       4       -680.2 
Fixed bandwidth: 1290.476 AICc value: 5809.719 
 Iteration    Log-Likelihood:(With bandwidth:  2549.964 )
=========================
       0        -2019 
       1        -1753 
       2        -1614 
       3        -1538 
       4        -1506 
       5        -1506 
Fixed bandwidth: 2549.964 AICc value: 4764.056 
 Iteration    Log-Likelihood:(With bandwidth:  2847.289 )
=========================
       0        -2108 
       1        -1862 
       2        -1736 
       3        -1670 
       4        -1644 
       5        -1644 
Fixed bandwidth: 2847.289 AICc value: 4791.834 
 Iteration    Log-Likelihood:(With bandwidth:  2366.207 )
=========================
       0        -1955 
       1        -1675 
       2        -1525 
       3        -1441 
       4        -1407 
       5        -1407 
Fixed bandwidth: 2366.207 AICc value: 4755.524 
 Iteration    Log-Likelihood:(With bandwidth:  2252.639 )
=========================
       0        -1913 
       1        -1623 
       2        -1465 
       3        -1376 
       4        -1341 
       5        -1341 
Fixed bandwidth: 2252.639 AICc value: 4759.188 
 Iteration    Log-Likelihood:(With bandwidth:  2436.396 )
=========================
       0        -1980 
       1        -1706 
       2        -1560 
       3        -1479 
       4        -1446 
       5        -1446 
Fixed bandwidth: 2436.396 AICc value: 4756.675 
 Iteration    Log-Likelihood:(With bandwidth:  2322.828 )
=========================
       0        -1940 
       1        -1656 
       2        -1503 
       3        -1417 
       4        -1382 
       5        -1382 
Fixed bandwidth: 2322.828 AICc value: 4756.471 
 Iteration    Log-Likelihood:(With bandwidth:  2393.017 )
=========================
       0        -1965 
       1        -1687 
       2        -1539 
       3        -1456 
       4        -1422 
       5        -1422 
Fixed bandwidth: 2393.017 AICc value: 4755.57 
 Iteration    Log-Likelihood:(With bandwidth:  2349.638 )
=========================
       0        -1949 
       1        -1668 
       2        -1517 
       3        -1432 
       4        -1398 
       5        -1398 
Fixed bandwidth: 2349.638 AICc value: 4755.753 
 Iteration    Log-Likelihood:(With bandwidth:  2376.448 )
=========================
       0        -1959 
       1        -1680 
       2        -1530 
       3        -1447 
       4        -1413 
       5        -1413 
Fixed bandwidth: 2376.448 AICc value: 4755.48 
 Iteration    Log-Likelihood:(With bandwidth:  2382.777 )
=========================
       0        -1961 
       1        -1683 
       2        -1534 
       3        -1450 
       4        -1416 
       5        -1416 
Fixed bandwidth: 2382.777 AICc value: 4755.491 
 Iteration    Log-Likelihood:(With bandwidth:  2372.536 )
=========================
       0        -1958 
       1        -1678 
       2        -1528 
       3        -1445 
       4        -1411 
       5        -1411 
Fixed bandwidth: 2372.536 AICc value: 4755.488 
 Iteration    Log-Likelihood:(With bandwidth:  2378.865 )
=========================
       0        -1960 
       1        -1681 
       2        -1532 
       3        -1448 
       4        -1414 
       5        -1414 
Fixed bandwidth: 2378.865 AICc value: 4755.481 
 Iteration    Log-Likelihood:(With bandwidth:  2374.954 )
=========================
       0        -1959 
       1        -1679 
       2        -1530 
       3        -1446 
       4        -1412 
       5        -1412 
Fixed bandwidth: 2374.954 AICc value: 4755.482 
 Iteration    Log-Likelihood:(With bandwidth:  2377.371 )
=========================
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 
Fixed bandwidth: 2377.371 AICc value: 4755.48 
 Iteration    Log-Likelihood:(With bandwidth:  2377.942 )
=========================
       0        -1960 
       1        -1680 
       2        -1531 
       3        -1448 
       4        -1414 
       5        -1414 
Fixed bandwidth: 2377.942 AICc value: 4755.48 
 Iteration    Log-Likelihood:(With bandwidth:  2377.018 )
=========================
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 
Fixed bandwidth: 2377.018 AICc value: 4755.48 
bw.fixed_refined
[1] 2377.371

The output for bw.fixed_refined is given above. We will use this optimal fixed distance value for model assessment in the next subsection.

10.3 Model Assessment.

gwlr.fixed_refined <- ggwr.basic(status ~ distance_to_tertiary_road +
                           distance_to_city +
                           distance_to_town +
                           is_urban +
                           usage_capacity +
                           water_source_clean +
                           water_point_population +
                           local_population_1km,
                      data = Osun_wp_sp,
                      bw = 2377.371,
                      family = "binomial",
                      kernel = "gaussian",
                      adaptive = FALSE,
                      longlat = FALSE)
 Iteration    Log-Likelihood
=========================
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 

Note that we use the cleaned version of the water point sf data frame for consistency in the geometrics with our model building (4 water points with missing values excluded).

10.4 Building Fixed Bandwidth GWR Model.

bw.fixed <- bw.ggwr(status ~ distance_to_primary_road +
                      distance_to_secondary_road +
                      distance_to_tertiary_road +
                      distance_to_city +
                      distance_to_town +
                      is_urban +
                      usage_capacity +
                      water_source_clean +
                      water_point_population +
                      local_population_1km,
                      data = Osun_wp_sp,
                    family = "binomial",
                    approach  = "AIC",
                    kernel = "gaussian",
                    adaptive = FALSE, # for fixed bandwidth
                    longlat = FALSE) # input data have been converted to projected CRS
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
 Iteration    Log-Likelihood:(With bandwidth:  95768.67 )
=========================
       0        -2889 
       1        -2836 
       2        -2830 
       3        -2829 
       4        -2829 
       5        -2829 
Fixed bandwidth: 95768.67 AICc value: 5684.357 
 Iteration    Log-Likelihood:(With bandwidth:  59200.13 )
=========================
       0        -2875 
       1        -2818 
       2        -2810 
       3        -2808 
       4        -2808 
       5        -2808 
Fixed bandwidth: 59200.13 AICc value: 5646.785 
 Iteration    Log-Likelihood:(With bandwidth:  36599.53 )
=========================
       0        -2847 
       1        -2781 
       2        -2768 
       3        -2765 
       4        -2765 
       5        -2765 
       6        -2765 
Fixed bandwidth: 36599.53 AICc value: 5575.148 
 Iteration    Log-Likelihood:(With bandwidth:  22631.59 )
=========================
       0        -2798 
       1        -2719 
       2        -2698 
       3        -2693 
       4        -2693 
       5        -2693 
       6        -2693 
Fixed bandwidth: 22631.59 AICc value: 5466.883 
 Iteration    Log-Likelihood:(With bandwidth:  13998.93 )
=========================
       0        -2720 
       1        -2622 
       2        -2590 
       3        -2581 
       4        -2580 
       5        -2580 
       6        -2580 
       7        -2580 
Fixed bandwidth: 13998.93 AICc value: 5324.578 
 Iteration    Log-Likelihood:(With bandwidth:  8663.649 )
=========================
       0        -2601 
       1        -2476 
       2        -2431 
       3        -2419 
       4        -2417 
       5        -2417 
       6        -2417 
       7        -2417 
Fixed bandwidth: 8663.649 AICc value: 5163.61 
 Iteration    Log-Likelihood:(With bandwidth:  5366.266 )
=========================
       0        -2436 
       1        -2268 
       2        -2194 
       3        -2167 
       4        -2161 
       5        -2161 
       6        -2161 
       7        -2161 
       8        -2161 
       9        -2161 
Fixed bandwidth: 5366.266 AICc value: 4990.587 
 Iteration    Log-Likelihood:(With bandwidth:  3328.371 )
=========================
       0        -2157 
       1        -1922 
       2        -1802 
       3        -1739 
       4        -1713 
       5        -1713 
Fixed bandwidth: 3328.371 AICc value: 4798.288 
 Iteration    Log-Likelihood:(With bandwidth:  2068.882 )
=========================
       0        -1751 
       1        -1421 
       2        -1238 
       3        -1133 
       4        -1084 
       5        -1084 
Fixed bandwidth: 2068.882 AICc value: 4837.017 
 Iteration    Log-Likelihood:(With bandwidth:  4106.777 )
=========================
       0        -2297 
       1        -2095 
       2        -1997 
       3        -1951 
       4        -1938 
       5        -1936 
       6        -1936 
       7        -1936 
       8        -1936 
Fixed bandwidth: 4106.777 AICc value: 4873.161 
 Iteration    Log-Likelihood:(With bandwidth:  2847.289 )
=========================
       0        -2036 
       1        -1771 
       2        -1633 
       3        -1558 
       4        -1525 
       5        -1525 
Fixed bandwidth: 2847.289 AICc value: 4768.192 
 Iteration    Log-Likelihood:(With bandwidth:  2549.964 )
=========================
       0        -1941 
       1        -1655 
       2        -1503 
       3        -1417 
       4        -1378 
       5        -1378 
Fixed bandwidth: 2549.964 AICc value: 4762.212 
 Iteration    Log-Likelihood:(With bandwidth:  2366.207 )
=========================
       0        -1874 
       1        -1573 
       2        -1410 
       3        -1316 
       4        -1274 
       5        -1274 
Fixed bandwidth: 2366.207 AICc value: 4773.081 
 Iteration    Log-Likelihood:(With bandwidth:  2663.532 )
=========================
       0        -1979 
       1        -1702 
       2        -1555 
       3        -1474 
       4        -1438 
       5        -1438 
Fixed bandwidth: 2663.532 AICc value: 4762.568 
 Iteration    Log-Likelihood:(With bandwidth:  2479.775 )
=========================
       0        -1917 
       1        -1625 
       2        -1468 
       3        -1380 
       4        -1339 
       5        -1339 
Fixed bandwidth: 2479.775 AICc value: 4764.294 
 Iteration    Log-Likelihood:(With bandwidth:  2593.343 )
=========================
       0        -1956 
       1        -1674 
       2        -1523 
       3        -1439 
       4        -1401 
       5        -1401 
Fixed bandwidth: 2593.343 AICc value: 4761.813 
 Iteration    Log-Likelihood:(With bandwidth:  2620.153 )
=========================
       0        -1965 
       1        -1685 
       2        -1536 
       3        -1453 
       4        -1415 
       5        -1415 
Fixed bandwidth: 2620.153 AICc value: 4761.89 
 Iteration    Log-Likelihood:(With bandwidth:  2576.774 )
=========================
       0        -1950 
       1        -1667 
       2        -1515 
       3        -1431 
       4        -1393 
       5        -1393 
Fixed bandwidth: 2576.774 AICc value: 4761.889 
 Iteration    Log-Likelihood:(With bandwidth:  2603.584 )
=========================
       0        -1960 
       1        -1678 
       2        -1528 
       3        -1445 
       4        -1407 
       5        -1407 
Fixed bandwidth: 2603.584 AICc value: 4761.813 
 Iteration    Log-Likelihood:(With bandwidth:  2609.913 )
=========================
       0        -1962 
       1        -1680 
       2        -1531 
       3        -1448 
       4        -1410 
       5        -1410 
Fixed bandwidth: 2609.913 AICc value: 4761.831 
 Iteration    Log-Likelihood:(With bandwidth:  2599.672 )
=========================
       0        -1958 
       1        -1676 
       2        -1526 
       3        -1443 
       4        -1405 
       5        -1405 
Fixed bandwidth: 2599.672 AICc value: 4761.809 
 Iteration    Log-Likelihood:(With bandwidth:  2597.255 )
=========================
       0        -1957 
       1        -1675 
       2        -1525 
       3        -1441 
       4        -1403 
       5        -1403 
Fixed bandwidth: 2597.255 AICc value: 4761.809 

10.5 Conclusion

We see that the model accuracy and specificity improve very slightly by removing the non-statistically significant variables from the gwLR model, but the sensitivity drops slightly.